Skip to content

Faster primality test between 65 and 128 bits#2624

Merged
fredrik-johansson merged 5 commits intoflintlib:mainfrom
fredrik-johansson:isprime81
Apr 4, 2026
Merged

Faster primality test between 65 and 128 bits#2624
fredrik-johansson merged 5 commits intoflintlib:mainfrom
fredrik-johansson:isprime81

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

@fredrik-johansson fredrik-johansson commented Apr 2, 2026

We improve fmpz_is_prime for input between 65 and 81 bits, more precisely for $2^{64} \le n < 3317044064679887385961981$, with the addition of a new function n_ll_is_prime.

The speedup is around 4x for random input, 5x for primes, and 4.5x for composites without small factors. Also fmpz_is_probabprime is sped up nearly 2x by using the same algorithm which is now faster than the previous probabilistic BPSW test (though it could be sped up further in the future by switching back to a more optimized BPSW).

We were already using the certified Sorenson-Webster Miller-Rabin test with 13 bases here which as far as I know is the best rigorous primality test for this range. However, this was previously just implemented using fmpz functions, and although the inner loop is in mpn_powm which is pretty decent (GMP has specializations for two-limb moduli), we can do better.

The whole primality test is now implemented using optimized "one-and-a-half-limb arithmetic", i.e. double-limb arithmetic where we exploit the fact that we have a lot of slack in the high limb. This allows doing most of the modular arithmetic on non-canonical residues in an extremely streamlined way (even more streamlined than Montgomery arithmetic): the squaring + mod n step in modular exponentiation is just

sqrlo_3x2x2
mulhi_2x2_sloppy
mullo_2x2
sub_ddmmss

which looks quite optimal, and the scalar multiplication by the small base is just

mullo_2x1

without the need for reduction mod n. Currently the modular arithmetic is entirely ad hoc for this primality test, but this could be used elsewhere, e.g. for the Lucas test to speed up fmpz_is_probabprime.

A second improvement is to do fused Miller-Rabin tests three bases at a time. No actual SIMD; we just interleave the operations of three modular exponentiations to improve instruction level parallelism. On Zen 3 this speeds up the modular exponentiations by about 1.8x.

(With AVX512 I expect that a true SIMD implementation doing all 12 odd bases simultaneously should be even faster. I'm less certain about AVX2 due to the lack of 64-bit multiplication. One could probably use pairs of double to represent ~100-bit integers for this algorithm,
but that would be much more messy to implement.)

A third improvement is to reimplement trial division using Granlund-Montgomery and to reduce the number of trial primes.

Detailed timings for fmpz_is_prime; average time in seconds per call testing $10^5$ random integers of the given bit size:

 bits     old        new     speedup
  65    4.16e-07      1e-07  4.160
  66    4.13e-07   1.02e-07  4.049
  67    4.07e-07   1.01e-07  4.030
  68    3.98e-07      1e-07  3.980
  69    4.13e-07   1.04e-07  3.971
  70    4.06e-07   1.05e-07  3.867
  71    4.08e-07   1.04e-07  3.923
  72    4.04e-07   1.04e-07  3.885
  73    4.07e-07   1.05e-07  3.876
  74    4.06e-07   1.05e-07  3.867
  75    4.06e-07   1.04e-07  3.904
  76    4.11e-07   1.06e-07  3.877
  77    4.14e-07   1.08e-07  3.833
  78    4.11e-07   1.07e-07  3.841
  79       4e-07   1.05e-07  3.810
  80    4.17e-07   1.09e-07  3.826
  81    4.13e-07   1.09e-07  3.789

Random primes as input:

 bits     old        new     speedup
  65   1.276e-05   2.64e-06  4.833
  66   1.299e-05   2.69e-06  4.829
  67   1.345e-05   2.73e-06  4.927
  68   1.389e-05   2.77e-06  5.014
  69   1.423e-05   2.78e-06  5.119
  70   1.422e-05   2.83e-06  5.025
  71   1.433e-05   2.85e-06  5.028
  72   1.449e-05    2.9e-06  4.997
  73   1.461e-05   2.93e-06  4.986
  74   1.478e-05   2.97e-06  4.976
  75    1.49e-05   3.01e-06  4.950
  76   1.506e-05   3.06e-06  4.922
  77   1.515e-05    3.1e-06  4.887
  78    1.53e-05   3.12e-06  4.904
  79   1.546e-05   3.16e-06  4.892
  80   1.562e-05   3.19e-06  4.897
  81   1.577e-05   3.23e-06  4.882

Composites without small factors:

 bits     old        new     speedup
  65    1.84e-06   3.64e-07  5.055
  66    1.79e-06    3.7e-07  4.838
  67    1.77e-06   3.72e-07  4.758
  68    1.78e-06   3.75e-07  4.747
  69    1.79e-06   3.79e-07  4.723
  70    1.79e-06   3.82e-07  4.686
  71    1.81e-06   3.89e-07  4.653
  72    1.83e-06   3.93e-07  4.656
  73    1.84e-06   3.99e-07  4.612
  74    1.85e-06   4.02e-07  4.602
  75    1.87e-06   4.08e-07  4.583
  76    1.89e-06   4.13e-07  4.576
  77    1.91e-06   4.17e-07  4.580
  78    1.91e-06   4.22e-07  4.526
  79    1.96e-06   4.26e-07  4.601
  80    1.84e-06   4.32e-07  4.259
  81    1.83e-06   4.37e-07  4.188

I was also curious to see how our primality test compares to primesieve ($n < 2^{64}$ only) and the relatively new implementation of the pseudosquares prime sieve by @kimwalisch for generating/counting all primes in a large interval.

                            primesieve                 n_is_prime
                       1 thread     8 threads      1 thread  8 threads
[2^63, 2^63 + 10^5]      0.632         -             0.0032  0.00059
[2^63, 2^63 + 10^6]      0.640         -             0.0324  0.0061
[2^63, 2^63 + 10^7]      0.678         -             0.326   0.058
[2^63, 2^63 + 10^8]      0.950         -             3.255   0.475
[2^63, 2^63 + 10^9]      2.146         -            32.746   4.755
[2^63, 2^63 + 10^10]     7.450       3.821
[2^63, 2^63 + 10^11]    57.274      17.829

                       pseudosquares_prime_sieve     n_ll_is_prime
                       1 thread     8 threads      1 thread  8 threads
[2^64, 2^64 + 10^5]      0.101       0.070          0.00897   0.0016
[2^64, 2^64 + 10^6]      0.392       0.117          0.0906    0.021
[2^64, 2^64 + 10^7]      3.469       0.571          0.915     0.152
[2^64, 2^64 + 10^8]     35.024       5.367          9.237     1.380
[2^64, 2^64 + 10^9]    356.652      57.522         93.247    14.232
[2^64, 2^64 + 10^10]               622.894                  156.178

For (exactly) 64-bit integers just calling n_is_prime in a loop is competitive with primesieve on intervals of width up to about $10^7$ single-threaded, but maybe up to about $10^9$ multithreaded, depending on how many cores you have (we obviously get nearly perfect linear speedup just doing completely independent primality tests in parallel).

For 65-bit integers n_ll_is_prime seems to be consistently at least 4x faster than pseudosquares_prime_sieve. The results look pretty similar for 81-bit integers. At least for the sizes I've tested there doesn't seem to be an asymptotic trend suggesting an eventual cutoff where pseudosquares_prime_sieve wins (only the hard 81-bit limit for n_ll_is_prime); if there is a cutoff, it would seem to be for intervals significantly wider than $10^{10}$.

Obviously a hybrid method where you replace independent trial divisions with partial sieving should do a bit better yet.

@fredrik-johansson
Copy link
Copy Markdown
Collaborator Author

fredrik-johansson commented Apr 2, 2026

We were already using the certified Sorenson-Webster Miller-Rabin test with 13 bases here which as far as I know is the best rigorous primality test for this range.

Actually https://arxiv.org/abs/1806.08697 suggests that BPSW is provably correct for $n < 2^{80}$ (at least if one can rule out numbers with four or more prime factors, e.g. with preliminary sieving). This could be about 3x faster for prime input. However, one would have to check very carefully that the specific parameters of the test match the hypotheses (this seems to be with a Fibonacci test while fmpz_is_probabprime_BPSW uses a Lucas test).

@fredrik-johansson
Copy link
Copy Markdown
Collaborator Author

I've fixed an embarrassing bug in the trial division in the first version and extended n_ll_is_prime to do a partial primality test (trial division + base-2 sprp) for all <=128-bit integers when the certified test isn't applicable. This allows the function to be used for all <=128-bit input to handle the easy cases in fmpz_is_prime and fmpz_is_probabprime.

New timings for all sizes:

random input
                   fmpz_is_prime                fmpz_is_probabprime
     bits     old        new     speedup     old       new      speedup

      65    4.12e-07   1.06e-07  3.887     2.49e-07   1.05e-07  2.371
      66    4.04e-07   1.06e-07  3.811     2.42e-07   1.04e-07  2.327
      67    4.06e-07   1.06e-07  3.830     2.43e-07   1.05e-07  2.314
      68    4.02e-07   1.06e-07  3.792      2.4e-07   1.05e-07  2.286
      69    3.98e-07   1.06e-07  3.755     2.38e-07   1.04e-07  2.288
      70    3.96e-07   1.06e-07  3.736     2.37e-07   1.05e-07  2.257
      71    3.97e-07   1.06e-07  3.745     2.38e-07   1.05e-07  2.267
      72    4.02e-07   1.07e-07  3.757      2.4e-07   1.07e-07  2.243
      73    3.86e-07   1.05e-07  3.676     2.34e-07   1.03e-07  2.272
      74    4.07e-07   1.09e-07  3.734     2.46e-07   1.08e-07  2.278
      75    3.95e-07   1.07e-07  3.692     2.42e-07   1.06e-07  2.283
      76    4.03e-07    1.1e-07  3.664     2.48e-07   1.08e-07  2.296
      77    3.89e-07   1.07e-07  3.636      2.4e-07   1.07e-07  2.243
      78    4.04e-07   1.11e-07  3.640      2.5e-07    1.1e-07  2.273
      79    4.09e-07   1.13e-07  3.619     2.52e-07   1.11e-07  2.270
      80    4.05e-07   1.13e-07  3.584     2.52e-07   1.11e-07  2.270
      81    3.97e-07    1.1e-07  3.609      2.5e-07    1.1e-07  2.273
      82    5.37e-06    5.2e-06  1.033     2.67e-07   1.25e-07  2.136
      83    8.52e-06   7.91e-06  1.077     2.57e-07   1.33e-07  1.932
      84    8.81e-06   8.66e-06  1.017     2.62e-07   1.37e-07  1.912
      88   1.026e-05  1.012e-05  1.014     2.63e-07   1.33e-07  1.977
      92   1.175e-05  1.159e-05  1.014     2.73e-07   1.39e-07  1.964
      96   1.412e-05  1.404e-05  1.006     2.78e-07   2.48e-07  1.121
     100   1.543e-05  1.537e-05  1.004     2.83e-07   2.56e-07  1.105
     104   1.604e-05  1.597e-05  1.004     2.85e-07   2.55e-07  1.118
     108   1.675e-05  1.665e-05  1.006      2.9e-07    2.6e-07  1.115
     112   1.903e-05  1.897e-05  1.003     2.96e-07   2.66e-07  1.113
     116   1.953e-05  1.947e-05  1.003     3.01e-07   2.68e-07  1.123
     120   1.989e-05  1.984e-05  1.003     3.04e-07   2.74e-07  1.109
     124   2.219e-05  2.211e-05  1.004     3.12e-07   2.83e-07  1.102
     128   2.327e-05   2.32e-05  1.003     3.42e-07   3.19e-07  1.072

prime input
                   fmpz_is_prime                fmpz_is_probabprime
     bits     old        new     speedup     old       new      speedup

      65   1.307e-05   2.65e-06  4.932     5.23e-06   2.67e-06  1.959
      66   1.264e-05   2.72e-06  4.647      5.2e-06    2.7e-06  1.926
      67   1.275e-05   2.73e-06  4.670     5.21e-06   2.73e-06  1.908
      68   1.283e-05   2.82e-06  4.550     5.23e-06   2.77e-06  1.888
      69   1.297e-05   2.79e-06  4.649     5.17e-06   2.81e-06  1.840
      70   1.305e-05   2.83e-06  4.611     5.21e-06   2.84e-06  1.835
      71   1.316e-05   2.87e-06  4.585     5.22e-06   2.88e-06  1.812
      72   1.328e-05    2.9e-06  4.579     5.27e-06   2.91e-06  1.811
      73    1.34e-05   2.94e-06  4.558     5.35e-06   2.94e-06  1.820
      74   1.352e-05   2.99e-06  4.522     5.41e-06   2.99e-06  1.809
      75   1.365e-05   3.03e-06  4.505     5.45e-06   3.02e-06  1.805
      76   1.381e-05   3.06e-06  4.513     5.52e-06   3.07e-06  1.798
      77   1.397e-05    3.1e-06  4.506     5.61e-06   3.09e-06  1.816
      78   1.409e-05   3.14e-06  4.487     5.63e-06   3.15e-06  1.787
      79    1.42e-05   3.17e-06  4.479     5.68e-06   3.18e-06  1.786
      80   1.433e-05   3.22e-06  4.450     5.75e-06   3.22e-06  1.786
      81    1.45e-05   3.25e-06  4.462     5.83e-06   3.25e-06  1.794
      82   0.0002827  0.0002776  1.018     5.93e-06   3.99e-06  1.486
      83   0.0004645  0.0004634  1.002     6.04e-06   4.51e-06  1.339
      84   0.0005034  0.0004977  1.011     6.04e-06   4.54e-06  1.330
      88   0.0006151  0.0006135  1.003      6.3e-06    4.7e-06  1.340
      92   0.0007202  0.0007183  1.003     6.56e-06   4.88e-06  1.344
      96   0.0008837  0.0008828  1.001      6.8e-06   6.27e-06  1.085
     100   0.0010334  0.0010278  1.005     6.96e-06   6.46e-06  1.077
     104   0.0010925  0.0010905  1.002     7.26e-06   6.74e-06  1.077
     108   0.0012138   0.001214  1.000     7.49e-06   6.92e-06  1.082
     112   0.0013908  0.0013896  1.001     7.72e-06   7.11e-06  1.086
     116   0.0015291  0.0015305  0.999     7.97e-06   7.35e-06  1.084
     120   0.0016927  0.0016883  1.003      8.2e-06   7.66e-06  1.070
     124   0.0018437  0.0018399  1.002     8.45e-06   7.85e-06  1.076
     128   0.0020064  0.0020104  0.998     9.01e-06   8.39e-06  1.074

hard composites
                   fmpz_is_prime                fmpz_is_probabprime
     bits     old        new     speedup     old       new      speedup

      65    1.67e-06   3.73e-07  4.477     1.65e-06   3.71e-07  4.447
      66    1.65e-06   3.79e-07  4.354     1.61e-06   3.77e-07  4.271
      67    1.63e-06    3.8e-07  4.289      1.6e-06   3.78e-07  4.233
      68    1.63e-06   3.85e-07  4.234      1.6e-06   3.82e-07  4.188
      69    1.66e-06   3.88e-07  4.278     1.61e-06   3.85e-07  4.182
      70    1.64e-06   3.92e-07  4.184     1.62e-06    3.9e-07  4.154
      71    1.67e-06   3.97e-07  4.207     1.63e-06   3.95e-07  4.127
      72    1.69e-06   4.02e-07  4.204     1.65e-06      4e-07  4.125
      73    1.71e-06   4.07e-07  4.201     1.67e-06   4.03e-07  4.144
      74    1.71e-06    4.1e-07  4.171     1.69e-06   4.08e-07  4.142
      75    1.72e-06   4.15e-07  4.145      1.7e-06   4.14e-07  4.106
      76    1.74e-06   4.22e-07  4.123     1.72e-06    4.2e-07  4.095
      77    1.76e-06   4.24e-07  4.151     1.72e-06   4.23e-07  4.066
      78    1.77e-06   4.31e-07  4.107     1.73e-06   4.27e-07  4.052
      79    1.78e-06   4.34e-07  4.101     1.75e-06   4.33e-07  4.042
      80    1.81e-06    4.4e-07  4.114     1.77e-06   4.38e-07  4.041
      81    1.81e-06   4.45e-07  4.067     1.78e-06   4.42e-07  4.027
      82    1.83e-06    4.5e-07  4.067      1.8e-06   4.49e-07  4.009
      83    1.84e-06   4.53e-07  4.062     1.82e-06   4.51e-07  4.035
      84    1.87e-06   4.61e-07  4.056     1.82e-06   4.58e-07  3.974
      88    1.94e-06   4.79e-07  4.050     1.88e-06   4.76e-07  3.950
      92    1.99e-06   4.97e-07  4.004     1.96e-06   4.96e-07  3.952
      96    2.05e-06   1.59e-06  1.289        2e-06   1.58e-06  1.266
     100    2.11e-06   1.65e-06  1.279     2.06e-06   1.64e-06  1.256
     104    2.17e-06   1.69e-06  1.284     2.14e-06   1.69e-06  1.266
     108    2.23e-06   1.74e-06  1.282     2.19e-06   1.74e-06  1.259
     112    2.27e-06    1.8e-06  1.261     2.24e-06    1.8e-06  1.244
     116    2.32e-06   1.86e-06  1.247      2.3e-06   1.86e-06  1.237
     120    2.38e-06   1.91e-06  1.246     2.37e-06   1.91e-06  1.241
     124    2.45e-06      2e-06  1.225     2.43e-06   1.99e-06  1.221
     128     2.9e-06   2.42e-06  1.198     2.82e-06   2.39e-06  1.180

The improvement for fmpz_is_prime is pretty negligible between 82 and 128 bits since actually certifying candidate primes dominates everything else, but the speedup for fmpz_is_probabprime is not bad.

@fredrik-johansson fredrik-johansson changed the title Faster primality test between 65 and 81 bits Faster primality test between 65 and 128 bits Apr 3, 2026
@fredrik-johansson fredrik-johansson merged commit 616028f into flintlib:main Apr 4, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant